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Abstract 

We present a distributed-memory library for computations with dense structured matrices. A matrix 
is considered structured if its off-diagonal blocks can be approximated by a rank-deficient matrix with 
low numerical rank. Here, we use Hierarchically Semi-Separable representations (HSS). Such matrices 
appear in many applications, e.g., finite element methods, boundary element methods, etc. Exploiting 
this structure allows for fast solution of linear systems and/or fast computation of matrix-vector products, 
which are the two main building blocks of matrix computations. The compression algorithm that we 
use, that computes the HSS form of an input dense matrix, relies on randomized sampling with a novel 
adaptive sampling mechanism. We discuss the parallelization of this algorithm and also present the 
parallelization of structured matrix-vector product, structured factorization and solution routines. The 
efficiency of the approach is demonstrated on large problems from different academic and industrial 
applications, on up to 8,000 cores. 

This work is part of a more global effort, the STRUMPACK (STRUctured Matrices PACKage) 
software package for computations with sparse and dense structured matrices. Hence, although useful on 
their own right, the routines also represent a step in the direction of a distributed-memory sparse solver. 


1 Introduction 

1.1 Background 

Many applications involve dense matrix computations with structured (or low-rank, or data-sparse) matrices, 
i.e., matrices that are compressible in some sense. In some applications, these matrices are rank-deficient or 
nearly so and can be readily compressed exactly or approximately using such algorithms as SVD, CUR [22], 
or a rank-revealing factorization. In many applications, the matrix is not (nearly) singular, but contains 
low-rank blocks, typically the blocks away from the main diagonal. Such matrices appear in the boundary 
element methods and finite element methods [9, 18] for solving partial differential equations (PDEs). In 
the discretized matrices, the low-rank off-diagonal blocks arise because the associated Green’s functions are 
smooth. The low-rank structured matrices also arise in applications that involve Toeplitz matrices (e.g., 
quantum chemistry, time-series analysis, queuing theory...), etc. Identifying and compressing these low- 
rank blocks is the key to reducing the storage and computational costs of many matrix operations, such as 
solving linear systems, performing matrix-vector products, and computing eigenvalues. 

Different algebraic low-rank representations have been proposed in the literature. In particular, %- 
matrices, "H^-matrices, and Hierarchically Semi-Separable (HSS) matrices have been widely studied. It is 
not our goal to review these techniques and we recommend the references listed in [36] for an overview. Some 
of these low-rank representations have been successfully implemented in software packages, but we are not 
aware of many publicly-available parallel libraries. In previous works, two codes based on the multifrontal 
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method for solving sparse linear systems embedded HSS algorithms: Hsolver, a distributed-memory geometric 
code for finite-difference discretizations on regular meshes [31], and StruMF, a sequential algebraic code [25]. 
The other software packages that use low-rank approximation techniques include: Hlib (for %- and Ti.^- 
matrices) [7], and MUMPS (sparse direct solver with Block Low-Rank approximation techniques) [2, 1]. 

1.2 Contributions of this work 

Despite a large number of papers on the asymptotically low complexity of HSS-based representation and 
operations, the methods are mostly inaccessible to the high-performance computing community due to 
the lack of parallel software. Our work aims at providing a scalable package that can be used in large- 
scale applications. We have developed STRUMPACK - STRUctured Matrices PACKage - a package for 
computations with sparse and dense matrices. It combines HSS representations with a randomized sampling 
technique, which was not the case in our previous contributions. STRUMPACK has presently two main 
components: a distributed-memory dense matrix computations package and a shared-memory sparse direct 
solver. In this paper, we present the distributed-memory package. It is implemented using MPI and contains 
the following features: 

• Compression into HSS form using randomized sampling. 

• Solving linear systems using ULV-like factorization and solution. 

• Computing HSS matrix-vector products. 

STRUMPACK is a general package that does not make any assumption on the input matrix. It is 
algebraic (as opposed to geometric) and can work on any number of MPI processes. Our previously-developed 
geometric solver Hsolver also employs HSS compression and factorization kernels that can be used in a 
standalone way for dense matrices [32]. However, Hsolver is limited in usability - it is a simplified code that 
works only with power-of-two number of processes and in single precision complex arithmetic. STRUMPACK 
does not have these limitations and, as presented here, it employs more recent algorithm advances (e.g., HSS 
combined with randomized sampling). It typically outperforms Hsolver, as we demonstrate in Section 4.6. 

In summary, the contributions of this work are the following: 

• The library we present here (part of the STRUMPACK package) is the first randomized, distributed- 
memory, general purpose package for HSS matrix operations. It can use any number of MPI processes, 
not restricted to power-of-two as in Hsolver. It is up to 6x faster than the dense kernels used in 
Hsolver [32]. 

• We developed a flexible task-to-process mapping algorithm to accommodate non-uniform hierarchical 
matrix partitionings and unbalanced HSS trees. Therefore, the algorithms herein are fast for a wide 
range of applications (see Sections 3.1 and 4.2). 

• We developed an efficient parallel adaptive sampling method that is essential for problems with rank 
structures that cannot be estimated a priori. This permits the solver to be used in a black-box fashion 
and increases its usability (see Section 3.3). 

• We evaluated our algorithms for large-scale problems from a wide range of different academic and 
industrial applications, using large number of cores. 

The rest of the paper is organized as follows. In Section 2, we review HSS techniques and the different 
ingredients of the HSS framework (HSS compression, ULV factorization and solution). In Section 3, we 
present our parallelization approach. We show how tasks are mapped and parallelized, we present our 
adaptive sampling mechanism, and we describe the communication features of our compression algorithm 
(number of messages and volume of communication). In Section 4, we report on results using matrices from 
different applications. We show how HSS algorithms behave for different applications, and we present weak 
and strong scaling experiment to assess the performance of our code. 
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2 Background on Hierarchically Semi-Separable matrices 

We briefly introduce Hierarchically Semi-Separable (HSS) matrices. We mostly follow the notation used by 
Martinsson [23]. We recommend [36] for more theoretical aspects, and [37] for the use of HSS techniques for 
solving Toeplitz problems. The following references are works related to solving sparse linear systems using 
HSS techniques: [35] (geometric setting, serial code), [33] (algebraic setting, serial code), [34] (algebraic 
setting, HSS techniques combined with randomized sampling, serial code), and [31] (geometric setting, 
distributed-memory code). 

2.1 Representation 

HSS representations rely on a cluster tree that defines a hierarchical clustering (or partitioning) of the index 
set [l,n], where n is the number of rows and columns of the matrix we consider. A cluster tree is such that 
every node r is associated with an interval It- The root node is associated with the interval [l,n], and, for 
every node r of the tree with children vi and 1 ^ 2 , we have It = Ivi U 1 ^ 2 , and I^^ I^Iv 2 =0 (for simplicity we 
only consider binary trees, but the generalization is straightforward). The numbering of the nodes is done 
top-down; the root node is 0, and a node numbered i has children numbered 2i + l and 2i + 2. In Figure 1(a), 
we show a possible cluster tree of [l,n]. In this example, the children of 2 are 5 and 6. 

[l,n] 


C/3,1/3 C/4,1/4 C/5,14 C/6,14 

[ 14 ] [f + l.f] [f + l-x] + ^3 C/4 D, 

(a) Cluster tree. (b) Three-level HSS tree. 




Figure 1: Cluster tree and HSS tree associated with the example in Section 2.1. 


Any n X n matrix A can be written into HSS form as follows: 

1. Considering a 2 x 2 partitioning of A, i.e., a two-level cluster tree (one root node and two leaves), the 
off-diagonal blocks of A are decomposed into an “SVD-like” UBV form: 


Ai.i Ai,2l ^ r c/i C/“sHi,2l4‘’'®*’ 
A2,1 ^ 2 , 2 ] C/2‘’*®B2,i1/“®* £>2 


( 1 ) 


The D matrices are simply the diagonal blocks of A and the U^B^V matrices are called generators. 
We explain the reason of the superscript in the third point. This decomposition holds for any 

matrix A, but it is useful in practice (i.e., it reduces storage requirements and can be used for fast 
operations with A), only if the off-diagonal blocks of A are low-rank. As mentioned in the introduction, 
this happens in many applications such as boundary element and finite element methods. When the 
off-diagonal blocks are low-rank, the U matrices are “tall and skinny”, the B matrices are small and 
square or nearly square, and the V* matrices are “short and wide”, the aspect ratio depending on the 
ranks. The C/, B, V matrices are computed using a rank-revealing factorization; we elaborate on this 
in the next sections. 

Note that in situations where the off-diagonal blocks have large ranks, we may wish to approximate 
them instead of computing their exact UBV decomposition; we show how in Section 2.2. In this case, 
Equation (1) provides us with an approximation of A that can be used, e.g., for preconditioning. 
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Note that this partitioning of the matrix corresponds to partitioning [l,n] as [l,n] = /i U / 2 . 


2. Recursively, i.e., considering a three-level cluster tree, the off-diagonal blocks of the diagonal blocks of 
A are also decomposed into U, B, V form, and so on. After another stage of recursion: 


D3 C/3“«R3.4R4“®* 

C/4“«R4.3VJ‘«* Di 




( 2 ) 


This partitioning corresponds to h = I 3 U I 4 and I 2 = hU Iq. 


3. There is a recursive relation between the generators appearing at different stages of recursions (which 
is the specificity of HSS and "H^-matrices over the other classes of "H-matrices, and explains the use of 
the superscript): 


= 


fr/big 

0 


0 


Ui 





0 


Vi 


Thus, 


[/4^'®R4.3VJ‘®* D 4 


rc/“® 

0 


0 

5 

0 

Uf\ 

C/2H2.1H1* 

0 



0 


n 1 

^ 3 

0 


UlBi^2V2* 

1 

0 

0 

crq 

* 

1_ 


D 5 


This property is called the nested basis property. 


(3) 


(4) 


In general, the HSS representation of A follows the structure of the cluster tree: 

• For each leaf node r, the corresponding diagonal block Dr = A^r^Ir) is left untouched (uncompressed, 
or “full-rank”). 


• For each non-leaf node r with children ui and 1 ^ 2 , the corresponding off-diagonal blocks 
and are represented (exactly or approximately) by:^ 


r/big n T/big* 
^ 1 ^ 1 , 1^2 1^2 


Furthermore, the hierarchical relation holds, i.e., basis are nested: 

0 


[/big ^ 


■[/big 

0 


[/big 
1^2 . 


Ur 


yhig ^ 


ybig 

'^121 

0 


Rbig 
1^2 . 


Vr 


(5) 


( 6 ) 


Note that we never have to store or form explicitly the “big” matrices at non-leaf nodes. Indeed, U 
at node r is given by Ur and the Hbig matrices at its children vi and V 2 ^ which are themselves given by 
looking at the grand-children of t, and so on. At leaf nodes, = U. In Figure 1(b), we show the tree 
corresponding to the previous example. 

It is important to mention that the order of the rows and columns of matrix A matters. If A is shuffled 
randomly, the low-rank property is lost. In practice, matrices from real-life applications are often generated 
following an order that preserves the low-rank property. This was the case with all the matrices that we use 
in Section 4. This point is developed in the literature [25, 31, I]. 

In the rest of this section, we show how to obtain the HSS form of a matrix using randomized sampling. 
Then, we describe the different operations that can be performed with an HSS representation: matrix-vector 
product, ULV factorization (a specialized LU factorization), and triangular solution. 

^In the subsequent sections, when the context is clear, we will use equal sign instead of approximately equal. 
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2.2 Compression with randomized sampling 

Compression, i.e., construction of the HSS form of a matrix, is the most important algorithm of the HSS 
framework. Once the matrix is compressed, fast operations, such as a specialized factorization or specialized 
matrix-vectors products can be performed. We provide algorithmic details in the following sections. 

The HSS compression algorithm we use is based on randomized sampling, which is essentially done by 
multiplying the input matrix with a set of random vectors. It was introduced by Martinsson [23] and was 
also used by Xia et al. in a sparse multifrontal solver [34] and for algorithms for Toeplitz matrices [37]. 
The main advantage of this approach is that it does not require explicit access to all the entries of A; it 
only requires a matrix-vector product routine and access to selected elements of A. Therefore, A does not 
need to be explicitly formed, which saves memory, and the algorithm can benefit from an application-specific 
matrix-vector product. Furthermore, using randomized sampling simplifies the embedding of HSS kernels 
within a sparse solver [34]. This is the other component of the STRUMPACK project and is described in [16]. 

Using a classical matrix-vector product, the complexity of the compression operation is 0{rn^) 

with r the maximum rank found during the compression, that we refer to as the HSS rank of A. In many 
applications, r is much smaller than n. For example, it can be a small constant (e.g., 2D Poisson problems), 
or grow slowly with n (e.g., logn for 2D Helmholtz or for 3D Helmholtz problems) [33]. If a fast 
(typically 0{n)) matrix-vector product is available, the complexity drops to 0{r^n). Most of the floating¬ 
point operations happen when computing the samples, i.e., in the matrix-vector product. In a parallel 
setting, this helps load balancing in situations where very different ranks appear in different branches of the 
HSS tree. 

We briefly recall how HSS compression without randomized sampling works, as described in [36]. The 
main property that we use is that, at each node r, the off-diagonal row blocks and column blocks A(Ir, Io\It) 
and A(/o \ It, It) are low-rank, denoting /q = [l,n]. These blocks are referred to as the strip row Hankel 
Mocks and strip column Hankel Mocks of A in [9]. Consider row blocks. We traverse the tree following a 
postorder, from the leaf nodes up to the root node. At a leaf node I, A{Ii,Iq \ Ii) is low rank and we can 
find a basis Ui for the rows, by using a rank revealing factorization: A{Ii,Io \ Ii) = UiXi. At the parent 
node p, we wish to compress A(Ip, Iq \ Ip). However compressing this block directly is potentially expensive 
and does not make use of the nested basis property. Instead, we use: 


A{Ip,Io\Ip) 


'A{Ip„Io\Ip) 


'Up,Xp,{:,Io\Ip) 


Up, 

0 ■ 

'Xp,{:,Io\Ip) 

A{Ip„Io\Ip)_ 


Up,Xp,{:,Io\Ip)_ 


0 


Xp,{:,Io\Ip)_ 


(7) 


Our objective is to compress A{Ip, Iq \ Ip) as A{Ip, Iq \ Ip) = Up'^Xp] using the above equation, we get 


and Xp by computing a rank revealing factorization of 


(■) ^ instead of compressing A(/„, Iq \ Ip) 

\ tpjj 


directly. This process is illustrated in Figure 2. Column blocks are compressed in a similar way to obtain 
the V generators, using the X obtained during the compression of row blocks. 

The randomized compression algorithm follows a similar process, except that it relies on samples of the 
input matrix instead of accessing the matrix directly. For now we suppose that the maximum rank r is known 
a priori. We relax this assumption in Section 3.3. Let i?’’ and R'^ he n x d tall and skinny random matrices 
with d = r + p columns, where p is a small oversampling parameter (Martinsson recommends p = 10). Let 
S''’ = AR^ and S° = A*R‘^ be samples for the row and column bases of A respectively. For a non-leaf node 
T with children vi and r' 2 , let Dt be defined as 


Dt 


n A 

^ 1 ^ 2 , ^U'2 


If {ti, r 2 ,..., Tg} are the nodes at level £ of the HSS tree, then 


D^^'> = diag(D^,, 11.^2, • ■ •, A'rJ 


is an n X n block diagonal matrix. The main idea of the randomized sampling algorithm is to construct a 
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(c) Before compression of parent: 


AiIp,Io\Ip) 


0 ■ 


0 



X,^{:,Io\Ip)_ 


(d) After compression of parent: 


AiIp,Io\Ip) 


0 

0 Ui,2 


UpXp 


Figure 2: Compression process without randomized sampling, at two child nodes i^i and 1^2 and their parent 
T. Full blocks are off-diagonal blocks to be compressed. Shaded blocks are that are left untouched. 


row sample matrix for each level of the tree as 

= (^ - = S^- 

This row sample matrix captures the action of a product of the block off-diagonal part of A with a set 
of random vectors R^. It is exactly this block off-diagonal part that needs to be compressed using low-rank 
approximation to obtain the HSS generators. Similarly, we compute a sample matrix using and to 
capture the column space of the off-diagonal blocks. 

A central component of the randomized sampling algorithm is the Interpolative Decomposition (ID) [II]. 
The ID computes a factorization of a rank-A: m x n matrix Y by expressing the columns of Y as linear 
combinations of a subset of columns of Y: 

[X, J] = ID(y), s.t. Y = Y{:, J)X where Y is m x k and X is k x n 

A compression tolerance e can be added as a parameter: 

[Ai, J] = ID(y, £), s.t. Y ~ Y{:,J)X where Y ism xk' and X is k' x n 

where the numerical rank k' < k. The ID can be computed using, for example, a QR factorization with 
column pivoting [8, 28] 

Y = Q RIV~^ (11: permutation matrix representing column pivoting) 

= Q [Ri R 2 ] n-i {Ri-.kx k) 

= (Qi?i)([/ i?r^i?2]n-i) 

= Y (:, J) X {Q Ri- first columns of pivoted Y) 
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A consequence of using Interpolative Decomposition is that is a submatrix of the 

original matrix A. Furthermore, it also leads to a special structure for the Ur and Vr generators: 

Ur=K and Vr=n^r 

where Ur and Vr have respective column ranks and r^, 11^ and 11^ are permutation matrices and the 
/’s are the Identity matrices, one is of order r^, the other of order r%. This structure is exploited in the 
factorization, as shown in Section 2.4, and it allows for faster operations with the generators. From a memory 
viewpoint, we only need to store the E matrices, and the permutation matrices 11 are represented by a single 
vector. A remarkable consequence is that when the block we want to compress is full-rank, the generators 
have the degenerate form Ur = n^J, and therefore they can be stored at very low cost (only the permutation 
information needs to be stored). 

The compression algorithm works as follows: 

1. Generate BA and random n x d matrices. 

2. Compute the samples = AR^ and 5° = A* R'^. 

3. Traverse the tree in topological order (i.e., children before parents): at each node, 

(a) Construct local samples. 

(b) Compute generators using Interpolative Decomposition. 

(c) Update samples and random vectors to make the construction of local samples faster at subsequent 
nodes. 

The detailed algorithm is presented in Algorithm I. Note that in the serial case, the topological order that 
we follow is simply a postordering of the HSS tree. However, in the parallel case, we follow a more general 
topological order, as described in Section 3.1. Note that the Interpolative Decomposition (step (3)(b), line 10 
in the algorithm) is the step where the user-given threshold e is used. The QR factorization with column 
pivoting stops when < e. 

2.3 Matrix-vector product 

Once a matrix is compressed into an HSS form, matrix-vector products can be computed in 0{rn), thus 
typically faster than using a classical 0{n‘^) product. However, the compression cost is 0{rn‘^) using a 
standard non-randomized algorithm, or using a randomized algorithm based on samples computed with 
standard matrix-vector products; therefore, it is amortized only when multiple products are computed, either 
successively or with blocks of vectors. This is the case for example in iterative linear solvers or eigensolvers. 
The HSS matrix-vector algorithm consists of two traversals of the HSS tree, as shown in Algorithm 2. The 
first traversal accumulates the actions of the V generators, while the other traversal uses the U generators 
as well as the Dr matrices. 

2.4 ULV-like factorization 

A matrix in HSS form can be factored using a special form of factorization called ULV factorization [10]. 
Then, the factored form can be used to obtain the solution to the linear system. We now describe the 
factorization algorithm, using a two-stage HSS example (i.e., a three-level tree) to aid exposition. 

In the original ULV factorization, fast orthogonal transformations are used to eliminate 0{n — r) un¬ 
knowns; the remaining 0(r} unknowns are eliminated using a standard LU factorization. The factorization 
we use does not use orthogonal transformations but instead it exploits the special structure of the HSS 
generators that comes from the Interpolative Decomposition. Algorithm 3 shows the complete ULV factor¬ 
ization procedure. In the following we explain how it works, starting from the one-stage HSS form (I), i.e., 
a two-level tree. 
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Algorithm 1: Computing the HSS representation of an unsymmetric matrix. 


Data: d = r + 10 with r an upper bound for the rank of A € 

S'' = AH'- and 5'^= = A*R‘' with {S'C S^, i?C R^} € 

A tree on the index vector [1, n] with an index set It at each node t 
R esult: Basis matrices defining the HSS matrix: 

Dt at the leaves, Ur, K at all nodes except the root 
at non-leaves for all children combinations 

1 foreach node r in topological order (bottom-up traversal) do 

2 if node t is a leaf then 

3 Dt=A{ItJt) 

4 SI, = S''{It,-.)-DtR'-{It,--) S(,, = S-{It.-.)-D(R%It,-.) 

5 else 

6 I Let vi and V 2 be the two children of node r 


[or _ D pr 

*^loc Qr _ U pr 
1 ^ 1/2 

end 

[iUTr,J(]=m{{su*) 

qr _ qr ( jr 

if node r is a leaf then 

i; = iriJ() 

else 


.^2. 


= A{i:,,nA 


qc _ D* pc 

qc _ ^l/2,i^l^l'2 

"^loc QC p* pc 

Pl/2 ^ 1 ^ 1 , 1 ^ 2 -^ 1^1 


[iVTr,J(] =miisu*) 

S) = SUJf ,:) 

R) = iUT)*R%Ir,:) 

n = w) 


K = {UrT 

I) = [i(, Jh 


19 end 


r 1 1 \-E'' /] T 

Recall that each U generator has the special structure Ur = H^ . Define Dt- = n ■ Then 

h/j. 1 U 

the transformation VItUt = j introduces a zero block on the top, where / is of order r(. Now consider the 
one-stage HSS decomposition (as in Equation (1)): 


A _ Aiq Ai^ 2 _ Di UiBi^2Vf 

^ 2,1 ^ 2^2 U2B2.\Vf D 2 


Applying Di and D 2 , we get: 


Di 0 
0 D 2 


B2,iVf 


0 

BiaVf 


At each node r, we partition HV = HtDt into the top (t) and bottom (b) parts, HV = where Wr-b 

^T-.b 





Algorithm 2: HSS matrix-vector product, for a non-symmetric matrix. 

Data: HSS form: Dr (leaves), Ur, Vr (all nodes except root), and Si/ 2 , 1/1 (non-leaves). 

Right-hand side x (one or more columns). 

Result: b = Ax. 

1 foreach node r in topological order (bottom-up traversal) do 

2 if node t is a leaf then 

3 I Vr = Vfx{Ir,-.) 

4 else 

5 Vr = Vr* 

6 end 

7 end 

8 Zr = 0 for root node 

9 foreach node r in reverse topological order (top-down traversal) do 

10 if node t is a leaf then 

11 I b(^Ir, •) — UrZr “t” DrXi^lr, •) 

12 else 

13 

14 I end 

15 end 




has r) rows, and we perform an LQ decomposition of HV;t, Wr-t = [Lr 0] Qr- Then, 

■ [Li 0] 0 

'^1 ] a\Q*i 1 = 

H 2 J [ O 2 J 0 [L 2 0] 

[B^.iVfQl W 2 .bQ *2 


Li 0 0 

Wi-bQl.t Wi.bQl., Bi^2V2*Q*2,t Bi^2VfQ*.,\ 

0 0 L 2 

B2.iVfQl.t B2,iVfQl.j, W2-bQ*2.,t 


W2;bQ*2,b 


Implicitly, if we swap block rows (and columns) corresponding to the {l;b} and {2;t} parts, denoted by a 

f/ I 

permutation matrix F i;h<->. 2 ;t = / q the above transformation can be written in the ULV factored form: 


-1 i l;b-i-> 2 ;t ’ 


Li 

0 L 2 

^2,1 Li 2 


where L 2 1 = Ti 2 = > and Dq is the reduced submatrix 

lB 2 ,lViWi.tj ’ I A' 2 -bQ 2 -t \ 


„ de^f Wi-bQX-b Bi^2VfQ*2.fj de^f 
° “ [B2,lVfQ\.^^ W2-,bQ*2.^b 


D, B,,2V2*Q*2,b 
B 2 ,iVfQl., D 2 


This is how the name “ULV-factorization” came from [10]. In the original form, both “U” and “V” 
transformations are orthogonal. But here, since Hr has the special structure stemming from the Interpolative 
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Decomposition, it may not be orthogonal. Therefore, we refer to it as a “ULV-like” factorization. Note that 
factorization (9) is not used in the solution procedure; instead, it is the transformation (8) that is actually 
used, as we show in the next section. This transformation is also used in the sparse factorization by Xia [34]. 

With the “L” form above, the unknowns corresponding to Li and L 2 can be eliminated using a regular 
forward substitution. The reduced submatrix Dq corresponds to the 0{r) remaining unknowns, and is 
formed at the parent node, which is also a root node, where LU{Dq) is performed. This is illustrated in 
Figure 3, with Dq colored red. 



Figure 3: Illustration of the one-stage ULV factorization process. 


Note that in the one-stage case presented above, we have to perform LQ factorizations of two matrices 
with order n rows and columns, assuming r is small; therefore the cost is 0(n^). To bring the asymptotic cost 
down, we need more levels in the HSS tree. In the next step, we consider the two-stage HSS decomposition 
in Equation (4), i.e., a three-level tree. We assume that the two diagonal blocks (children I and 2) are 
already transformed into ULV form (9), via U = diag(U;]"^, Ug and V = diag((53, ( 54 , Q 5 , Qe)- 

The remaining uneliminated blocks are 


Di 




W4-,bQl,b . 


B6,5V,*Qlb 


BbfiViQU 

We-,bQl,b . 


( 10 ) 


The two transformations diag(r 23 , U 4 ) and diag(U 5 , Ug) can be respectively applied to the off-diagonal blocks 
(1,2) and (2,1) of the matrix A. Due to the nested basis property (see (3)), t/g'® and t/ 4 '® are already annihi¬ 
lated to 

similarly for the (2,1) block of A. 

At the parent nodes 1 and 2 , which are non-root nodes and have Ui and U 2 bases associated with them. 


Therefore, the only nonzero part of the (1,2) block of A is UiBi^ 2 V 2 


ViQ%t 

ViQlt ViQt,. 


we use the transformations UiUi 


0 

I 


and U 2 C /2 


0 

I 


to introduce the zero blocks. Then, we apply the 


above annihilation and transformation to the diagonal blocks Di and D 2 (see (10)), followed by the LQ 
decomposition of each top part. Eventually, at the root node, only 0{r) unknowns are left and a regular 
LU factorization (with pivoting if needed) is used. 

The two-stage transformation process can be written as follows: 




■/ 


^3 


Q3 




'I 

Oi 

rr 3 ;b-H- 4 ;t 1 

Q4 


Q4 


3\b-(r^4:t 


Qi 

I 




Qi 


p'T 


I 





Qg- 




Q2- 


r 


T 


L 3 

0 

Li 



(r 2 iL 4 , 3 )t 

(r 2 iL 3 , 4 )t 

Li 



0 

L 5 
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The algorithm is presented in Algorithm 3. The complexity is 0{r^n) [10, 34]. Notice that the output 
of the algorithm is, at each non-root node r, the Qr and Lr matrices that represent the ULV factors, 
but also the matrix WV = ^IrDr and the matrix Vt, that accumulates the actions of V bases and the Q 
transformations, as shown in lines 5 and 15 of the algorithm. Vr is conceptually similar to except it 
has only 0{r) rows, corresponding to the uneliminated variables. The matrices Wt and Vr are useful for the 
solution phase, as shown in the next section. 


Algorithm 3: ULV-like factorization of a non-symmetric matrix in HSS form. 

Data: HSS form: Dr (leaves), Ur, Vr (all nodes except root), and Si/ 2 , 1/1 (non-leaves). 

Result: ULV factors: Qr orthonormal, Lr lower triangular (all nodes except root). LU at root. 
Wr and Vr to be used in solution step. 


1 foreach node r in topological order (fine to coarse) do 
if node t is a non-leaf then 


4 

5 

6 

7 

8 
9 

10 

11 

12 

13 


16 

17 

18 end 


Dr = 


L^i/i,i/2K/2;b 




if node r is not the root node then 

\K,;b ^0 

0 K2;6J 


v; = 


14 


end 
else 

I Vr=Vr 

end 

if node r is the root node then 

I [PrtLrjUrl = W (Dr) 

else 


Wr = VLrDr = 

LQ {Wr.,t) = [Lr 0] 
Vr = QrVr = 


'-Ef 

I 

Dr = 

'Wr-,i 

I 

0 


Wr-,b_ 


QT;t 

Qt'P 


Vr-,t 

Vr-b 


Dr = Wr:bQ*r.i, 


end 


2.5 Solution using ULV factorization 

The ULV-factored form (11) can be used to solve a linear system Ax = b. We still use the two-stage HSS 
example (three-level tree) to explain the solution procedure. 

Consider a partitioning of the right-hand side b and the solution vector x along the cluster tree: b = 


Xh,--)' 


h' 

XX--)_ 


p2 


and similarly, x = 


and the one-stage ULV factorization given in Equation ( 8 ). The 


solution X can be obtained by the following five steps: 


1. Transform the right-hand side: bi = Hi5i, and 62 = 142 ^ 2 ; 

2. Forward substitution: yi = Lf^bi-t, 2/2 = Lf^b 2 X, 
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3. Update right-hand side: 

bl-fi = hi-b — Wi-bQl-t 2/1 ~ ^l, 2 U 2 *( 52 ;t 2 / 2 , 

b2:b = b2-b — B2,lViQl.^ yi — W2;bQ2-,t 2/2; 


4. Triangular solution at root: xq = Uq ^Pq 



5. Orthogonally transform back to the original solution: xi = Qi 


2/1 

xo-,t 


, X 2 = Q2 


2/2 

Xo-,b 


Next, consider the two-stage ULV transformation given in Equation (11). Algorithm 4 shows the complete 
procedure, which follows a bottom-up traversal of the HSS tree. We first apply all the transformations 
involving O’s to the right-hand side b, to obtain b (line 10 in the Algorithm.) Then we obtain all the 
intermediate variable yr for the non-root node t via forward substitution (line 11 in the Algorithm). Now 
looking at the last block row of (11) involving Dq, we need the contributions from the children of the 
root node (nodes 1 and 2). For example, the intermediate solution yi coming from node 1 contributes 
to the terms Wi-^Qi-tUi and B 2 ^iV*Ql.fyi. Furthermore, there are contibutions coming from the grand 
children of the root node, i.e., nodes 3, 4, 5, and 6. For example, nodes 3 and 4 contribute via the term 


B2,iV,* 


v:qi. 


4;t 


v:qi 


4;& 


1 

7 


2/3 


Ql 


2/4 




.2/1. 


. In the general case (arbitrary number of levels), 


bo (updated right-hand side at root node) receives contributions from all the nodes in the tree, because the 
last block row of the L is full. In the algorithm, we accumulate these updates when going up the tree, as 
shown in lines 10 and 13 of the algorithm. We illustrate this in more detail in Appendix A. 

Finally, the intermediate solution involving y needs to be transformed back to the original solution x 
(line 21 in the Algorithm). The complexity of Algorithm 4 is 0{rn) [10, 34]. 


3 Distributed-memory parallelism 

In this section, we present our distributed-memory algorithms. We mostly focus on the implementation of 
the HSS compression algorithm, as this is the most complicated of all HSS operations but also the most 
critical for performance. In Section 3.3, we present a novel parallel adaptive sampling mechanism. 

3.1 Task mapping 

The HSS tree presented in Section 2.1 is a task graph and data-dependency graph for all the different 
operations: compression, factorization, solution, and product. The tree structure allows for two levels of 
parallelism. Tree parallelism comes from the fact that nodes lying on different branches of the tree can 
be processed in parallel, independently of one another. Node parallelism consists in assigning a node of 
the tree to multiple processes. We enforce node parallelism by using parallel kernels from PBLAS [121 and 
ScaLAPACK [6]. 

We rely on a static mapping technique to assign tasks to different processes. We use the idea of the 
proportional mapping by Pothen and Sun [26], which is popular for mapping tasks along the elimination tree 
of sparse factorizations. The mapping process consists in a top-down traversal of the tree. All the processes 
are assigned to work on the root node, because this is the last task to be executed during a bottom-up 
traversal (e.g., compression, factorization) and the first task to be executed during a top-down traversal 
(e.g., matrix-vector product and triangular solution). Then, for every node in the tree, the list of processes 
working at that node is split among its children, proportionally to the weights (determined according to 
a given metric) of the subtrees rooted at these children. Consider a parent node / in the tree with ncf 
children. Let p/ be the number of processes working at that node and Wi be the load of the subtree rooted 
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Algorithm 4: Solution of a linear system Ax = b after ULV-like factorization, for a non-symmetric 
matrix. 


Data: ULV factors: Qt orthonormal, Lr lower triangular (all nodes except root). LU at root. 
Result: x, solution of Ax = b. 

1 foreach node r in topological order (bottom-up traversal) do 

2 if node t is a non-leaf then 


else 

I br = b{Ir,-.) 

end 

if node t is the root node then 

I X, = U-^Lf^Prbr = 


br = ^rbr = 


Vt — h/r ^r;t 


Ufbr = 


if node t is a non-leaf then 

z, = y; Kl + % vr 


Zt = Vf t yr 


18 end 

19 foreach node r in reverse topological order (top-down traversal) do 

20 if node t is a non-leaf then 

I „ _/o* r 1 


= Qt 


Xl'2 Qiy^ 


22 else 


xQt j ■) — Xt 


24 I end 

25 end 


at a child i. The number of processes given to node i is 


Pi = 


W, 

Ejair/”' 


This procedure is applied in a recursive fashion to all the children of /; the recursion stops when leaf nodes 
are reached or entire subtrees are mapped onto single processes, which happens because the number of nodes 
in the tree is commonly much larger than the number of processes. 

The usual metric used at each step of the mapping is the workload of each subtree. However, in our 
case, we cannot use this because we do not know in advance the cost of processing a node since it depends 
on the ranks found at that node. Instead, we use the size of the interval It associated with each node t. 
The idea is that, at leaf nodes, the compression cost (computing local samples and performing Interpolative 
Decomposition) is proportional to the size of the interval. If the ranks found at different branches of 
the tree are balanced, workloads will be balanced. Otherwise, workloads might be unbalanced, leading to 


13 





poorer performance of the compression process. However, after the compression is done, the tree can be 
remapped using the rank information, which can be useful for subsequent operations (factorization, etc.) 
or for improving compression times for different problems from the same application. We illustrate this in 
Section 4.2. 

An interesting property of the proportional mapping is that the traversal of every process (i.e., the set 
of tasks that this process executes and the order in which those are processed) is fully known in advance. 
Indeed, every process is in charge of a sequential subtree and takes part in the computation of the parallel 
nodes in the path between that subtree and the root of the elimination tree; this defines a single possible 
traversal. Denoting by i the root of the sequential subtree mapped on a given process, the traversal followed 
by that process consists of a postorder traversal of the subtree rooted at i followed by the path from i to the 
root node. This makes the code easier to write. 

As we just saw, a node of the HSS tree can be mapped onto several processes. Within a node, our choice 
is to perform all the arithmetic operations with PEL AS and ScaLAPACK. All the matrices that we handle 
are distributed following a 2D block-cyclic scheme and each node of the HSS tree is associated with a 2D 
grid of processes that handle the computations. We typically try to make the grid as square as possible, 
as advised in the ScaLAPACK documentation [6], but the code can accommodate any kind for grid. For 
example, if 32 processes work at a node, our grid has = 5 rows and [32/5] = 6 columns. In this 

example, 32 — 6 x 5 = 2 processes stay idle at that node. However, it does not mean these processes are idle 
throughout the whole computation; they are idle at that node, but can be active at ancestors or descendants 
of that node. We illustrate this situation in Figure 4. In this example, node 7 is mapped on processes Pq to 
Pi, but Pi is out of the 2D grid associated with node 7 and is thus idle at that node. However it is active 
at nodes 4 and 6 (descendants of 7) and 15 (ancestor of 7). At a given node mapped on P processes, the 
associated grid is Pr x Pc and there are at most [V^J — 1 idle processes. 



Figure 4: Proportional mapping of an HSS tree with 9 processes and uniform weights. Every node is 
associated with a 2D grid of processes and, sometimes, a few idle processes. 


3.2 Parallel compression 

We provide some details about our implementation of the parallel HSS construction (compression) algorithm. 
The first stage of the compression algorithm is to generate random vectors. In STRUMPACK, different 
generators can be used: the legacy rand C function, or advanced generators from the C++11 standard, like 
the Mersenne Twister [24]. They can be combined with a postprocessing that enforces certain distribution 
of random numbers, e.g., uniform or normal. 

The second stage is to generate the samples S’’’ = AR'' and S'^ = A*R‘^. For this, the compression 
algorithm needs either access to a user-given matrix-vector product or explicit access to the whole matrix A. 
We require the input matrix to be distributed in 2D block-cyclic form and we use the PBLAS matrix-matrix 
product PxGEMM to compute the product. 
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The third stage is a topological traversal of the tree, where at each node, a local sample is formed then 
compressed and updated. To form the sample we need access to some selected elements of the matrix. For 
this, the compression algorithm needs either access to a user-given routine that provides selected elements 
or explicit access to the whole matrix A. If the input matrix A is explicitly given (in 2D block-cyclic 
form), we distribute it so that at each stage of the compression, a process can extract selected elements 
without communicating with processes working at other nodes of the HSS tree. This is done by traversing 
the tree following a serialized postorder (i.e., all the processes traverse the whole HSS tree synchronously). 
At each node, a piece of the original matrix (shared by all the processes) is redistributed to the subset of 
processes that work at that node. The diagonal blocks of A correspond to leaves of the HSS tree and are 
redistributed to the processes working at these nodes, so that they can extract a diagonal block without 
communicating with processes mapped at other nodes. Similarly, the off-diagonal blocks are also distributed 
along the mapping of the tree, so that the and matrices at non-leaf nodes can be extracted 

without communication. For each block, we rely on a 2D block-cyclic distribution using the process grid 
associated with the corresponding node. 

We provide an example in Figure 5, corresponding to the mapping in Figure 4. Consider node 1. The 
first step in the compression at node 1 is to extract Di = A(/i,/i) from the input matrix. These entries are 
distributed on Pq and Pi and readily available. Then, at node 3, which is mapped on Pq, Pi and P 2 , matrix 
Bi ^2 = A(/[,/|) is extracted by selecting some rows and columns of A{Ii,l 2 ). A{Ii,l 2 ) is distributed on 
P 07 Pi and P 2 , therefore the extraction can be done without communicating with processes working at other 
nodes. 


Po Pi 

P 0 P 1 P 2 

Po Pi „ 
-I-P4 

P2 P3 

Po Pi P2 

P3 P4 P5 

Pe P7 Pg 

P0P1P2 

P2 

PPr 

P2 P3 

P3 

P3 P4 

P3 P4 

P4 

Po Pi P2 

P3 P4 P5 

Pe P7 Pg 

P5 

P5 Pe 

P5 Pe 

P7 Pg 

P5 Pe 

Pe 

P5 Pe 

P7 Pg 

P7 

P7 Pg 

P7 Pg 

Pg 


Figure 5: Distribution of the input matrix conforming to the mapping in Figure 4. 

After building random vectors, computing samples, and distributing the input matrix, the postorder 
traversal starts. Serial subtrees (subtrees mapped on one process) are processed by a sequential compression 
routine that relies on BLAS and LAPACK kernels (which is usually better than using PBLAS or ScaLAPACK 
kernels serially). Then, parallel nodes are processed using PBLAS and ScaLAPACK operations. The main 
computational kernels are matrix-matrix products (performed with PBLAS PxGEMM) and the Interpolative 
Decomposition procedure described previously. For the latter, we explored two options: 

1. Modifying the xGEQPS and PxGEQPF from LAPACK and ScaLAPACK respectively. These routines 
perform a QR factorization with column pivoting but they compute the full factorization. We modified 
them to embed our compression threshold e. The factorization stops when the norm of the pivot 


15 




column becomes too small, i.e., < e, with R the partial R factor. The number of columns actually 

eliminated is the e-rank of the block to be compressed. 

2. Implementing a Modihed Gram-Schmidt (MGS) algorithm with column pivoting. The parallel imple¬ 
mentation uses 2D block-cyclic operations. A similar version was used in Hsolver [32]. 

In a parallel setting, we have not observed much difference in performance between the two options. In a 
serial setting, the modified xGEQPS routine, which uses a BLASS implementation, is typically two to three 
times faster than our BLAS2 MGS implementation. 

3.3 Adaptive sampling mechanism 

The algorithm in Section 2.2 assumes that the HSS rank r of the input matrix is known, so that the number 
of sample vectors d (number of columns of R^ and is chosen to be a tight upper bound of r. Indeed, 
d needs to be larger than r to get a stable compression, but it also needs to be not too large, because the 
sampling process requires 0 {dn‘^) operations and can dominate the other parts of the compression stage. 

In practice, r is rarely known. For some specific applications, we have a rough idea of its value, as 
described in Section 2.2. In order to get a more black-box compression process, it is important to design an 
adaptive sampling mechanism. This is mentioned in [23, 37] but neither an algorithm nor an implementation 
is described in detail. Here we explain our parallel adaptive sampling algorithm and implementation. The 
idea is to start with a low number of random vectors d, and whenever the rank found during Interpolative 
Decomposition is too large, d is increased. Instead of restarting the compression from scratch, we keep the 
generators that have been computed and the computation restarts at the node(s) where the rank was too 
large. 

In a serial setting, the sketch of the algorithm is the following. When the rank at a given node Tfau is too 
large, add new columns to i?’’ and i?'^, compute the new columns of S'^ and S‘^ with a product, and restart 
the postorder traversal: 

1. At nodes preceding Tfau, keep the generators {D, U, etc.) that were previously computed. Update 

and with new columns. 

2. At Tfaii, update and with new columns, and recompute the Interpolative Decomposition. If 
the rank is again too large, restart again, otherwise proceed to the next node. 

3. At nodes following TfaU, proceed as before. 

In this serial mechanism, a node can have three states: it can be UNTOUCHED if it has never been traversed 
before, PARTIALLY_COMPRESSED if the local samples have been computed but the rank obtained by Inter¬ 
polative Decomposition was found too high (i.e., the traversal restarts because of this node), or COMPRESSED 
if the generators have been successfully computed. There can be at most one PARTIALLY_COMPRESSED node 
in the tree. All the nodes that precede that node in the postorder are necessarily COMPRESSED, and all the 
nodes that follow that node in the postorder are necessarily UNTOUCHED. 

In a parallel execution, since we follow a parallel topological ordering of the tree instead of a serial 
postorder, we have different options. The choice we made is to implement a “late notification” mechanism. 
Whenever a process finds that the number of random vectors is not sufficient, it does not immediately notify 
the other processes. Instead, it simply invalidates the current node by leaving it UNTOUCHED. Then, whenever 
a parent node is activated, we check the state of its two children. If they are not both COMPRESSED, the 
parent node is left UNTOUCHED. Therefore, all the ancestors of the node that failed are left untouched. All 
the processes meet at the root node and can generate new random vectors, recompute samples, and restart 
the traversal. The difference with the serial case is that the tree can contain several PARTIALLY_COMPRESSED 
nodes. All the descendants of these PARTIALLY_CDMPRESSED (i.e., failed) nodes are compressed, and all their 
ancestors have been left UNTOUCHED. This adaptive sampling mechanism is shown in Algorithm 5. 

The main idea of this approach is that the different branches make as much progress as possible as long 
as the number of random vectors is sufficient. In the serial case, whenever a node fails, the traversal restarts 
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with more random vectors, meaning that the subsequent branches will be processed with more - and maybe 
unnecessary - random vectors. As a consequence, different executions on different numbers of processes will 
lead to slightly different ranks and HSS representations. 

Another choice, that we have not implemented, would be an “early notification” mechanism where pro¬ 
cesses are notified as early as possible that a node has failed somewhere in the tree. This is more complicated 
to implement and requires asynchronous communications to avoid barriers at each level or node of the tree. 
It is not clear that it would be significantly faster. 
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Algorithm 5: Processing a node r. 


1 if myid is in the 2D grid of r then 

2 if r non-leaf and not all children are COMPRESSED then 

3 state stays UNTOUCHED 

4 return 

5 end 

// Sampling 

6 if node is UNTOUCHED then 

7 I Extract D or - 812 ,- 621 , compute local samples and Sf 


Compute updates to the samples, e.g., — DRf^p^ or 


cr TD IDT 

^lupd ^2upd 

Qr D nr 

^2upd ^lupd 


// Interpolative Decomposition 
if node is PARTIALLY.COMPRESSED then 

// Merge updates into the samples and random vectors 

pr , ryr ryr or , or or rtc , pc oc , oc Qc 

-n ^ l^-a ^ , -rt ^ |^-rt ^ 

end 

if node is not COMPRESSED then 

Try Interpolative Decomposition of S''’ if rank too small then 
Throw away U and Ir 
Mark node as PARTIALLY.COMPRESSED 
return 


I Same for S" ^ S’’(/^,:), ^ S"(/^,:) 

else 

I SSp,^S2p,(/c,:) 

end 

// Update 

if node is UNTOUCHED then 
I i?'’ = P* X ..., = P* X ... 

else 


Kpd = v*x... 


Kpd = y* X 


if node is COMPRESSED and parent is UNTOUCHED then 
I // Merge updates into the samples and random vectors 


-8'’ ^ 


pr pr 
^ ^upd 


or , or or 


pc , Tic Tic QC , QC QC 

-rt ^ l^-rt ^ 


31 end 

32 Mark node as COMPRESSED 

33 else 

// myid is out of the 2D grid 

34 Receive state from Pt if state==COMPRESSED then 

35 I Receive ranks and indices from Pr- 

36 else 

37 I restart() 

38 end 

39 end 





3.4 Communication analysis 

We briefly analyze the amount of communication of our parallel compression algorithm. The analysis is 
similar to the one we derived previously on non-randomized algorithms [32]. We consider that each node of 
the HSS tree has the same rank r for its U and V generators; for some applications, a specific rank pattern 
can be used instead, as it is sometimes done in the literature [33, 34], but this is not our goal here. We also 
consider that, at the leaf nodes, the diagonal blocks have size 0{r). Finally, we assume that the number of 
processes is a power of 2, and the HSS tree is a complete binary tree. The pair [^messages, #words] is used 
to count the number of messages and the number of words transferred during a given operation, typically 
along the critical path. For example, a broadcast of w words among p processes is modeled as [logp, w logp]. 
This assumes that the broadcast follows a tree-based implementation; there are logp steps on the critical 
path (any branch of the tree) and w words are transferred at each step, yielding logp messages and w\ogp 
words. 

We denote n the size of the matrix and p the total number of processes. The parallel compression 
algorithm has three main steps: 


1. Matrix-matrix product to compute the samples. We use the PxGEMM routine from PEL AS that relies 
on the SUMMA algorithm [30] and can be modeled, asymptotically, as [rlogp, ■^]. This relies on the 
fact that, when computing a product S = AR, the PxGEMM routine selects an algorithm that reduces 
communication based on the size of the operands A, S, R. In our case, matrix A is the largest operand, 
so PxGEMM chooses an algorithm that communicates only S and R. The selection strategy is described 
in [17]. 

2. Initial distribution of the matrix along the HSS tree, as described in Section 3.2. This is a serialized 

postorder traversal of the parallel part of the tree, where, at each node t, we use the PxGEMR2D routine 

from ScaLAPACK to redistribute a block of the matrix with size Ur x rir from the p processes to the Pr 

2 

processes that work at r. The cost for one such redistribution is [p, —] for the receiving processes and 
2 

[pr, ^] for the sending processes [27]. To get the total cost, we sum over the 0{p) nodes of the parallel 
part of the tree, and we use the fact that, at level i (0 being the root node), a node r is associated with 
two blocks of the original matrix with rir = ^ rows and columns and is mapped on ^ processes. 
Each level has 2® nodes; at a given level, each process is receiver at one node (the node mapped on 
that process) and sender at 2® — 1 nodes. Therefore, the number of messages is 


logp logp ^ 

X! (l-PA (2* - 1)^) = 2plogp-p ^ — =plogp-p-C>(l) = C>(plogp) 


level 2=1 level 2=1 

Similarly, the number of words to be transferred is: 


logp 

E 

level 2=1 


p/2® p 


2®+i-I 


It 

p ^ 

level 2=1 


22i 


p p 


Therefore, the cost for the initial distribution is, asymptotically, [plogp, ^]. 

3. Postorder traversal of the tree to compute the generators. At a given node, there are three main 
ingredients: 

(a) Matrix-matrix products to compute the samples and updates. Using the above assumptions, all 
the blocks have size 0{r) x 0{r) (e.g., 2r x r). The cost is thus [rlogpr, '^^]- 

(b) Interpolative decomposition of a block of size (!I(r)x(!I(r) with rank (!I(r); the cost is [r log Pr , ] 

(using Equation (4.1) from [32] with M = N = r). 

(c) Redistribution of blocks of size 0(r) x 0(r) to the parent; the cost is [I, ^] [32]. 
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The term corresponding to the redistribution (c) is negligible compared to the two other terms, and 
the term corresponding to Interpolative Decompositions (b) dominates the term corresponding to local 
matrix-matrix products (a). We sum (b) over the critical path (a branch of the tree). This time we 
number the levels so that the leaves of the parallel tree are at level 0, and the root is at level logp. At 
level i, a node is mapped on pi = 2* processes. The number of messages is 


logp—1 logp—1 

r log Pi = r i = 0{r log^ p) 

i=l i=l 

The number of words is, similarly. 


logp-l 

E 




2 logp* 

r - 


logp-l 



2=1 


0 {^) 


We summarize the results in the following table: 


Algorithm 

Messages 

Words 

ScaLAPACK LU 

O{n\ogp) 

0 

Non-randomized HSS compression 

0 {p -1- r log^ p) 

O + rn + r^ logp^ 

Randomized HSS compression 

O(p\ogp + r logp -1- r log^p) 

dist GEMM tree 

°(v + ■» + ’■’) 

dist GEMM tree 


Table 1: Summary of communication costs. 

Now we take a closer look at the various communication costs in the randomized algorithm (last row of 
Table 1). 

• In terms of latency, the initial distribution dominates for problems with small maximum rank, while 
the traversal of the tree dominates for problems with large rank. 

• In terms of bandwidth, when the rank is large, i.e., r > the traversal of the tree dominates 

the matrix-matrix product, and the matrix-matrix product dominates the initial distribution. When 
r is small, i.e., r < 0{-^), the initial distribution dominates the matrix-matrix product, and the 
matrix-matrix product dominates the traversal of the tree. 

Comparing our randomized compression algorithm to ScaLAPACK LU, one can observe that, for prob¬ 
lems with small rank r, our algorithm communicates fewer messages and less communication volume than 
ScaLAPACK does. However, for a large rank, it can be the opposite. We illustrate this in Section 4.6. 

Comparing our randomized compression algorithm to the non-randomized one previously developed, we 
observe the following: 

• In terms of latency, our algorithm has a slightly larger complexity due to the logp in the distribution 
term and the latency of the matrix-matrix product. We are investigating a way to reduce the number 
of messages to 0{p) in the initial distribution phase. For the matrix-matrix product, we could benefit 
from advances in communication-avoiding algorithms, such as the 2.5D matrix multiplication [29]. 

• In terms of bandwidth, for both compression algorithms, the first term corresponds to the initial 
distribution of the input matrix (Step (2) above). Afterwards, in the non-randomized HSS compression, 
there is a term for the row compression {rn) and a term for the column compression (r^logp). In 
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the randomized algorithm, we have a term corresponding to the matrix-matrix product used for the 
sampling phase, and a term corresponding to the tree traversals. These terms are smaller than what 
appears in the communication cost of the non-randomized algorithm. Therefore, our randomized 
algorithm communicates fewer words, and we expect better performance in practice. We illustrate this 
in Section 4.6. 

3.5 Parallel factorization, solution, and product 

The parallelization strategy for the factorization, triangular solution, and matrix-vector product are similar 
to the one we use for the compression. We exploit both tree parallelism, using a proportional mapping of 
the tasks, and node parallelism, by using PBLAS and ScaLAPACK operations. Serial subtrees are processed 
using sequential routines written using BLAS and LAPACK. 


4 Experimental results 

4.1 Applications 

We report experimental results using the following matrices: 

• Toeplitz matrix: a matrix A = [aij] is a Toeplitz matrix (or diagonal-constant matrix) ifV(i, j), = 

Oj+ij+i. We experimented with two Toeplitz matrices. The first one is a simple matrix with at^i = 
and Qij = i — j. It is diagonally dominant and yields very low HSS rank (a small constant). The 

second one is a kinetic energy matrix from quantum chemistry [19]; ai^i = ^ and aij = where 

d is a discretization parameter (grid spacing). This matrix yields slightly larger maximum rank (that 
grows slowly with n) and is fairly ill-conditioned. This is a collaboration with D. J. Haxton and J. 
Jones at Lawrence Berkeley National Laboratory. 

• Matrices from boundary element methods: these matrices are known to be structured [18, 4]. We 
obtained matrices from G. Sylvand (Airbus), and B. Notaros and A. Manic (Colorado State University). 
The matrices represent electromagnetic spheres (or collections of spheres). These matrices are known 
to be structured although the maximum rank is often large. 

• Matrices from finite differences: it is known that the inverse of a sparse matrix arising from finite 
differences is dense and structured [18, 9]. More specifically, the dense matrices that appear during 
sparse Gaussian Elimination are structured. Different approaches have been used to exploit this fact, 
especially in the context of the multifrontal method [14]: using Block Low-Rank representations [1], 
Hierarchically Off-Diagonal Low-Rank matrices [3] and HSS techniques [33, 34, 31]. In our experiments, 
we use dense matrices coming from the sparse factorization of the discretized Helmholtz equation; these 
matrices are generated by our code Hsolver [31]. 

• Covariance matrices: spatially correlated Gaussian random fields are useful in many modeling 
applications. They can be generated by solving an eigenvalue problem with a covariance matrix [21]. 
These matrices are dense and are generally very large as they have as many degrees of freedom as 
the computational (physical) domain. However they are often compressible. We experimented with 
a covariance matrix generator provided by Panayot Vassilevski and Umberto Villa at the Lawrence 
Livermore National Lab, that relies on the MFEM code [20]. 

• "H—matrices: we use an in-house matrix generator that produces "H— matrices (i.e., they have low-rank 
off-diagonal blocks but there is no recursive relation between the different blocks) with prescribed size. 
Such matrices can be compressed using HSS techniques; even though the maximum HSS rank will be 
larger than the maximum "H—rank, the compression can sometimes be done faster, depending on the 
problem. 


21 



We use two parallel machines at the National Energy Scientific Computing Center (NERSC). Hopper is 
a Cray XE6 system with 6384 nodes; each node has two twelve-core 2.1 GHz AMD Opteron 6172 processors 
and 32 GB of main memory. Edison is a Cray XC30 system; each node has two twelve-core 2.4 GHz Intel 
Xeon E5-2695 processors and 64 GB of main memory. 


4.2 General trees 

In most of our examples and experiments, we use complete binary trees. Here, we briefly illustrate that 
our code can handle more general trees. This is important, as, in some applications, the clustering of the 
variables might not be a straightforward recursive bisection, thus the tree might not be balanced. Here, we 
use an example where the tree is a binary “comb”, i.e., for each pair siblings, only one of the siblings has 
children. Hilbert matrices exhibit such a structure [4]. 

The matrix we use is an "H—matrix with size 40,000 x 40,000. It has the structure illustrated in Eig- 
ure 6(a), corresponding to the comb tree in Eigure 6(b). 



(a) Matrix structure. 


64 


64 



700 


700 


8 


2 2 


(b) Ranks. 


(c) Uniform mapping. (d) Weighted mapping. 




Figure 6: Structured matrix with a comb-shaped clustering tree (a). HSS compression with a comb-shaped 
HSS tree yields low maximum rank (b). The tree can be mapped to processes using a uniform mapping 
where pairs of siblings are mapped on the same number of processes (c) or a mapping that assigns more 
processes to right nodes (d). For example, in (c), the children of the root node are both mapped on 32 
processes, but in (d) and they are mapped on 16 and 48 processes. 

In Table 2, we report some experiments with this matrix. We compare the effect of using a comb-shaped 
tree instead of a binary tree for the HSS compression, and we illustrate that we can modify the weights used 
in the proportional mapping to improve performance. In the first experiment, the HSS compression is based 
on a binary tree with 4 levels shown in Figure 7(b). One can easily understand why the maximum rank 
is d2200 _ XQQQQ- comes from the fact that the (2,2) block of the matrix, of size 20,000 x 20,000 is not 
structured, i.e., not HSS compressible. Its off-diagonal blocks, of size 10,000 x 10,000 are full-rank. This 
yields the maximum rank because, at the striped node in Figure 7(b), the blocks to be compressed are the 
striped blocks in Figure 7(a) and they have rank 10,000 because they contain full-rank blocks (black in the 
figure). 

In the second experiment, the HSS compression is based on a comb tree with the same structure as 
in Figure 6(b). The compression is much faster and the maximum rank is only 700 (a parameter of our 
example). 

Finally, the last experiment consists in remapping the HSS tree by modifying the weights used in the 
proportional mapping. Instead of using even weights (which yield an even splitting of processes between 
the left and right branches of the tree), we choose to attribute more processes to right nodes. Whenever a 
pair of siblings is mapped, the right child inherits from 75% of the processes working at its parent. This is 
motivated by the fact that, in the factorization, we have to perform the LQ factorization of a 20,000 x 20,000 
block (corresponding to the (2,2) block of the original matrix), which is the most costly operation in the 
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Figure 7: Matrix from Figure 6 compressed using a complete binary tree. 


factorization and corresponds to the right child of the root node. By simply changing the weights in the 
mapping procedure using some knowledge of the input matrix, we significantly speed-up the factorization 
(65.8 seconds instead of 101.4 seconds), at the price of a small increase in the compression time (19.5 seconds 
instead of 14.1). 



Binary tree 

Comb tree 

Comb tree, remapped 

HSS compression time (s) 

704.1 

14.1 

19.5 

Maximum rank 

10000 

700 

700 

ULV factorization time (s) 

122.0 

101.4 

65.8 


Table 2: Experiments with the structured matrix in Figure 6 using 64 MPl tasks. 

This simple example illustrates that our code is flexible. It can handle different tree structures, and 
different task-to-process mappings, using any number of processes. In Hsolver, this was not the case. The 
code was restricted to some specific problems, relied on complete binary trees, and could only work with 
power-of-two number of processes. 

4.3 Solving linear systems 

In this section, we illustrate the performance of our code for seven different matrices from the abovementioned 
applications. The results are reported in Table 3. For each matrix, we experiment with four different 
compression thresholds (e = 10“®, 10“®, 10“"'^, 10“^) and we report statistics for the HSS compression, the 
ULV factorization and the triangular solution with iterative refinement. We provide run time, number of 
floating-point operations, size of the HSS/ULV factors, and we compare with the run time for solving a system 
with ScaLAPACK. In terms of memory, ignoring small temporary storage and communication buffers, the 
memory footprint for ScaLAPACK is simply the storage of the matrix A. In our new code, the memory 
usage consists not only of the original matrix, but also storage for the random vectors and the samples, and 
storage for the HSS and ULV factors. We report a memory overhead, which represents the extra memory 
usage of STRUMPACK relative to that of ScaLAPACK. 

Among this collection of matrices, the data type for the two matrices BEMMultiSphere and SCHURlOO 
is single precision. The other matrices are of double precision. Eor the single precision input, the compression 
threshold 10“® is very small, leading to almost no compression. Eor example, for matrix SCHURlOO, the 
(1,2) block is of size 5,000, whereas the maximum rank is 4933 with e = 10“®, which is essentially full rank. 
The memory overhead is much larger with HSS representation. This is mostly due to the ULV factors. 
Indeed, the special structure of the U and V generators keep memory usage low when blocks are full-rank, 
and the HSS form has roughly the same memory footprint as the input matrix. However, in this situation, 
ULV factors are usually much larger than the HSS form. Therefore, the practical use of HSS algorithms is 
not with very small compression threshold e. 
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Matrix 

Size 

e 


Solution with 

Comparison: HSS 







ScaLAPACK 

vs ScaLAPACK 




Max 

Factors 

Flops 

Time 

Factors 

Flops 


Flops 


Flops 

Time 

Memory 

Time 




rank 

(MB) 

(xlO®) 

(s) 

(MB) 

(xlO®) 


(xlO®) 


(xlO^^) 

(s) 

overhead 

speedup 



10 “® 

2 

14.6 

307.3 

11.4 

37.2 

mm 



0.2 



0 .1% 

75.2 

Simple 

80,000 

10 “® 

2 

14.2 

307.3 

11.3 

37.0 

mm 

lilBIiH 


1.2 

341.3 

856.6 

0 .1% 

68.1 

Toeplitz 

lO-'' 

3 

13.3 

307.3 

11.3 

36.3 

mm 



2.8 

0 .1% 

60.5 



HiB 

■ 

13.2 

307.3 

11.3 

36.2 

0.09 

0.006 

0.088 

5.4 



0 .1% 

51.0 





55.1 

6411.2 

19.0 

152.7 



1.0 

1.5 



1.5% 

43.5 

QChem 

80,000 



42.1 

5126.6 

17.6 

110.0 

0.86 

0.03 

0.9 

2.0 

341.3 

894.1 


45.5 

Toeplitz 

lO-'' 

120 

33.3 

3843.7 

16.7 

83.6 

mm 

0.02 

1.7 

5.6 

1 .0% 

40.0 



10 “^ 

30 

18.1 

1280.6 

13.4 

42.7 

mm 

0.01 

N/A 

N/A 



0.5% 

N/A 





2235.3 

24707.5 

52.5 


1865.0 

3.5 

2.1 

0.22 



10.3% 

15.3 

HMatrix 

80,000 

10 -® 

785 

2263.4 

24723.3 

53.8 


1897.9 

3.5 

4.3 


341.3 

862.1 

10.3% 

14.9 

lO-'* 

4 

14.3 

409.7 

11.1 


0.1 

0.006 

0.02 

m 

0 .2% 

72.9 



HiB 

■ 

13.2 

409.7 

11.1 

36.2 

0.1 

0.006 

0.02 

1.26 



0 .2% 

69.7 






972.4 



401.8 

3.1 





120 .1% 

0.3 

BEM 

10,002 



507.6 

624.3 

4.7 

1265.5 

172.6 

1.7 

0.5 

0.1 


3.3 

82.4% 

0.5 

Acoustics 



420.0 

453.6 

3.3 


109.2 

1.1 

0.7 

0.3 

66 .6% 

0.7 



10 “^ 



243.1 


667.4 







44.9% 

N/A 

BEM 

Multi 

Sphere 

27,648 



4489.0 

38980.1 

354.5 

15897.3 

26406.2 


8.9 

0.9 



403.2% 

0.07 

(Single 

prec.) 

10 -® 

2145 

811.7 

8499.1 

29.1 

2568.9 

1015.6 


1.5 

0.5 

14.1 

30.7 

53.5% 

0.9 

lO-'* 

1488 

425.6 

5396.0 

15.0 

1237.9 

322.8 


1.0 

0.7 

38.4% 

1.8 





8.5 

495.3 

52.2 


1.0 




25.2% 

2.9 


10,000 

10 “® 

4933 

763.0 

4520.4 

98.3 

2957.1 

2928.8 


16.3 

3.6 



722.6% 

0.03 

SCHURlOO 

(Single 

prec.) 

10 “® 

840 

136.2 

601.0 

6.3 

388.3 

61.0 


2.2 

1.7 

0.7 

4.2 

76.6% 

0.5 



90.3 

278.3 

3.2 

235.2 

23.0 


1.3 

1.4 

40.5% 

0.8 


10-2 


56.6 

153.0 

2.0 

134.0 

7.4 


1.3 

2.5 



25.6% 

0.9 





1221.6 

8976.3 

34.3 


1515.2 


0.4 

0.1 



61.1% 

0.9 

CovarSO 

27,000 

10 -® 

1609 

948.2 

6363.7 

18.6 


815.8 


3.2 

0.8 

13.1 

36.8 

47.7% 

1.7 

lO-'* 

215 

380.8 

1093.2 

3.4 


212.6 


N/A 

N/A 

14.3% 

N/A 



10-2 

3 

348.0 

49.6 

1.9 


205.4 


N/A 

N/A 



6.7% 

N/A 


Table 3: Solving linear systems from different applications using 64 MPI tasks. 


All the problems exhibit the same - and expected - behavior. When the compression threshold e is higher 
(e.g., 10“^), the compression and the factorization are faster and the HSS and ULV factors are smaller than 
when e is closer to machine precision. The gains in compression and factorization come at the price of 
accuracy; for some problems, the solution is inaccurate when e is too large. This is the case for matrices 
QChemToeplitz, BEMAcoustics and CovarSO. For some other problems, accuracy is satisfying with 
the largest value of e, but the best choice of e is not 10“^. For example, for problem Hmatrix, the best 
choice is e = 10“^. 

We now compare the behavior of our dense solver with ScaLAPACK. The last column in Table 3 is 
the speed-up of STRUMPACK with respect to ScaLAPACK. For synthetic problems (SimpleToeplitz, 
HMatrix) and problems with a very simple structure (QChemToeplitz), using HSS techniques yields 
very large gains. For example, for the HMatrix problem and £ = lO”"*, our solution process is 72.9 times 
faster than a traditional dense LU factorization. For problems BEMMultiSphere and Covar 30 , the gains 
are less impressive but still significant; STRUMPACK exhibits a 2.9x speed-up for BEMMultiSphere and 
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a 1.7x speed-up in run time for CovarSO. For the last two problems, SchurIOO and BEMAcoustics, 
STRUMPACK is slower than ScaLAPACK regardless of the parameters. These two matrices exhibit some 
low-rank property and the number of floating-point operations performed by STRUMPACK is lower than that 
of ScaLAPACK, but, however, the total run time is larger with STRUMPACK. This highlights a drawback 
of our approach. Traditional dense LU factorization is an algorithm with a very regular computational 
pattern, than can be written with BLASS kernels, and good implementations (e.g., vendor-tuned) usually 
exhibit very good flop-rate and can reach a very large fraction of the peak performance. On the other hand, 
algorithms that takes advantage of low-rank structures (e.g., HSS, but also "H-matrices or Block Low-Rank 
representations) have to deal with more irregular and imbalanced task flows, and manipulate a collection 
of small matrices instead of one large matrix. Therefore, these algorithms cannot be expected to reach the 
same flop rate as traditional algorithms. This is visible in Table 3. 

We want to highlight that for a given class of applications, using low-rank approximation techniques 
usually pay off past a certain size. This is because, although HSS techniques allow to solve linear systems 
with a lower asymptotic complexity, but with a larger constant prefactor. Also, as we mentioned previously, 
the flop rate with HSS is often lower than with traditional algorithms. These effects are visible in Section 4.6 
where we experiment with matrices of growing size from a particular application; in this framework, gains 
increase with problem size. 

The last point that we elaborate on is memory. As stated previously, the memory overhead is the 
extra memory usage of STRUMPACK relative to that of ScaLAPACK; calling memsca the memory usage 
of ScaLAPACK and merristr the memory usage of STRUMPACK, this is simply ■ It is 

important to understand that this memory overhead also represents the amount of memory that would 
be used if we were to use a matrix-free implementation. We recall that our algorithm is amenable to a 
matrix-free framework since it only requires access to a matrix-vector routine and selected elements, more 
specifically 0 {r^n), elements of the matrix (with r the maximum rank and assuming the tree has logn 
levels). For example, for matrix BEMMultiSphere, the memory overhead is 25.2%, which means: 

1. The memory consumption of STRUMPACK is 1.252 times that of ScaLAPACK. 

2. If we were to use a matrix-free version the memory consumption would be 25.2% that of ScaLAPACK, 
i.e., a 4-fold reduction. 

4.4 Fast matrix-vector product 

In this section, we briefly illustrate the use of HSS techniques for fast matrix-vector products. Here, the 
matrix is not factored with ULV but is simply kept in HSS form to perform matrix-vector multiplication. 
We use the power method (that relies mainly on matrix-vector products) to compute the largest eigenvalue 
of the QChemToeplitz matrix. 


HSS 

Traditional GEMV 

Speed-up 
with HSS 

Compression 

Iterations 

Iterations 

Max rank 

Factors (MB) 

Flops (xlO'-^) 

Time 

#It 

Flops (xlO'J) 

Time 

#It 

Flops 

Time 


147 

42.1 

5126.6 

17.6 

318 

1053.0 

21.8 

318 

4070.4 

69.7 

1.8 


Table 4: Power method for QChemToeplitz using 64 MPI tasks. 


4.5 Adaptive-rank mechanism 

In this section, we illustrate the behavior of the adaptive sampling mechanism described in Section 3.3. The 
matrix we use corresponds to an electromagnetic sphere discretized with the boundary element method and 
has size 58,800. 

In Figure 8, we examine three configurations. In Figure 8(c), we use 3,000 random vectors, which is 
enough to guarantee that we reveal the “true” rank of each node (in this discussion we only consider the 
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(a) In steps of 500. 


Rand. 
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2000 

3000 


Rand. 

1000 

2000 

3000 



- 2560 2648 



(b) In steps of 1,000. 


(c) No adaptive sampling. 


Figure 8: Ranks of the U generators using adaptive sampling in steps of 500 (a), in steps of 1,000 (b), and 
without adaptive sampling (c). An underlined rank means that the corresponding node triggers a restart. 
A rank marked in bold is final. 


rank of the U generator, for simplicity). In Figure 8(a), we start the compression with 500 random vectors. 
Every time a block is compressed, we look at the difference between its rank and the number of random 
vectors; if it was less than 200, we discarded the generators at the node we consider, we add 500 new random 
vectors, and the compression restarts. In this example, the four leaves of the tree are compressed in parallel; 
they all have rank 500, and the compression restarts with 500 + 500 = 1,000 random vectors. Again, this is 
not enough (rank 1000 is found at the leaves), and we add 500 new random vectors. 1,000 random vectors 
is not enough and the compression restarts with 1,000 + 500 = 1, 500 random vectors; this is is enough, and 
the compression restarts with 1500 + 500 = 2000 random vectors. This time, the leaves exhibit four different 
ranks: 1,592, 1,982, 1,993, and 1,660. At the leaves with 1,592 and 1,660, the generators are kept because 
the difference between their rank and the number of random vectors is more than the limit that we picked. 
However, the compression needs to restart because of the two other leaves. We add 500 random vectors 
and the traversal restarts. At the leaves that have rank 1,592 and 1,660, we simply update the samples 
and S^; their generators have been obtained at the previous iteration and are not recomputed. At the two 
other leaves, we recompute the generators. This fails again and the compression restarts with 3,000 random 
vectors. This time the ranks are small enough and the generators are kept. The compression proceeds to 
the next level then terminates. 

In Figure 8(b), we start with 1,000 random vectors and we add 1,000 random vectors whenever a step 
of compression fails; this time, the compression restarts only twice and successfully terminates with 3,000 
random vectors. One can observe that the ranks that we obtain using different numbers of random vectors 
vary slightly. This is an effect of the sampling mechanism, but it does not have any major effect on accuracy, 
or the size of the HSS and ULV representations. However, the adaptive sampling mechanism influences 
performance. In Table 5, we report on the run time for the HSS compression when the tree has 3 levels (as 
in Figure 8) and when the tree has 8 levels. One can observe that when the HSS tree has 3 levels, using 
the adaptive sampling mechanism induces a penalty in run time. This is due to the fact that processes need 
to synchronize to restart the computations, and the HSS tree has to be traversed multiple times instead of 
once. Also, instead of being computed in one shot, the samples S’’’ = AR^ and S‘^ = A*R'^ are computed in 
multiples stages, which mitigates the benefits of BLAS3 kernels. However, when the tree has more levels, we 
can see that the adaptive strategy can be faster than using directly the correct number of random vectors 
(3,000 here). This is due to the fact that, at the bottom of the tree, nodes have ranks much lower than 
3,000. Their generators can be computed with less random vectors (e.g., 500 or 1,000). Using less random 
vectors makes the Interpolative Decomposition faster, and it can potentially make the whole compression 
stage faster. 
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Levels 

Strategy 

in HSS 

Steps of 

Steps of 

No adaptive 

tree 

500 

1,000 

sampling 

3 

259.9 

223.9 

186.2 

8 

147.0 

125.1 

137.3 


Table 5: Time in seconds for the HSS compression as a function of the number of levels in the HSS tree 
and the sampling strategy (as in Figure 8). The matrix arises from the discretization of an electromagnetic 
sphere using BEM and has size 58,800. 


In a practical setting, it is impossible to predict what the fastest strategy is. However, in many applica¬ 
tions, practitioners have a rough idea of the compressibility of their matrices and can predict the order of 
magnitude of the maximum rank. In that case, we advise to set the sampling parameters so that, at worst, 
the compression routine needs to restart a limited number of times. For example, if the rank is expected to 
be between 1,000 and 10,000, we would start with 1,000 random vectors and increase the number by 1,000 
every time a step of compression fails, guaranteeing no more than 10 steps. 

4.6 Scalability 

In this section, we evaluate the scalability of our structured code using three experiments. 

In the first experiment, we process dense matrices with increasing size from the same application, using 
an increasing number of MPI processes. The experimental setting is the same as in [32]; we use the same 
system (Hopper at NERSC) and the same settings. The matrices we consider correspond to the root node 
of the multifrontal factorization of the discretized Helmholtz equations, with a fixed number of points per 
wavelength. They correspond to five different cubic meshes, ranging from 100 x 100 x 100 to 500 x 500 x 500. 
The topmost separators, i.e., the dense matrices we consider here, have between 10,000 and 250,000 rows 
and columns. In Table 6, we compare the performance of ScaLAPACK, Hsolver (more precisely, the dense 
kernel used within Hsolver), and STRUMPACK when solving a linear system with these matrices. Under 
this setup, the maximum HSS rank grows linearly with the mesh size k. Note that this is not strictly a weak 
scaling experiment since the number of processes does not increase as fast as the number of operations. The 
next experiment in this section is a strict weak scaling experiment. 

One can observe that STRUMPACK and Hsolver find similar maximum ranks for all the problems. 
The size of HSS factors is smaller with STRUMPACK, which is due to the special structure of U and V 
generators, as described in Section 2.2. In terms of performance, STRUMPACK is 2 to 6 times faster than 
Hsolver, and 1.8 to 5.4 faster than ScaLAPACK. It is interesting to notice that STRUMPACK spends less 
time in communication. The percentage of wall time spent in communications, as reported by the IPM 
tool [15], is similar to that of Hsolver, but the overall wall time is shorter, implying less time is spent 
doing communication (assuming computations and communications do not overlap, which is fair since our 
algorithm is mostly synchronous). 

The second experiment in this section is a strict weak-scaling experiment. We consider the root node 
of the multifrontal factorization of the discretized Poisson equation on a 2D mesh. The mesh is a k x k 
regular grid, therefore the dense matrix that we consider (last frontal matrix) is k x k. The multifrontal 
factorization yields frontal matrices that can be compressed using HSS techniques with a very low maximum 
rank [33], that is almost constant with respect to the size of the grid (in practice, it increases very slowly 
- logarithmically). In the experiment, we use a fixed number of random vectors (slightly larger than the 
rank of the largest problem). Therefore, the complexity of the HSS compression grows as 0{k^). In the 
experiment, the number of processes also grows as yielding a constant number of operations per process 
for the different grid sizes, as shown in Table 7. One can observe that the run time increases as k increases. 
This is due to the overhead of communications. In particular, for the last problem, the redistribution of 
the input matrix represents over 80% of the compression time. This is what is expected for problems with 
very small maximum rank, as shown in Section 3.4, Table 1. If we ignore the redistribution time, then 
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k (mesh: k x k x k) 

100 

200 

300 

400 

500 

Matrix size {=k‘^) 

10,000 

40,000 

90,000 

160,000 

250,000 

MPI tasks 

64 

256 

1,024 

4,096 

8,192 

Sea 

LAPACK 

Flops (xlO^^) 

2.7 

170.7 

1944.0 

10922.7 

41666.7 

Time (s) 

4.2 

57.7 

176.1 

313.6 

541.6 

Communication time 

30.5% 

20.5% 

24.9% 

40.4% 

36.1% 

Hsolver 

Maximum rank 

335 

618 

894 

1226 

1497 

HSS factors (GB) 

0.1 

0.8 

2.0 

4.6 

6.8 

Compression flops (xlO^^) 

0.8 

19.7 

115.2 

424.0 

1051.0 

Compression time (s) 

8.3 

51.5 

193.4 

207.8 

259.5 

Factorization flops (xlO^^) 

0.1 

0.7 

2.3 

7.2 

10.6 

Factorization time (s) 

0.4 

1.4 

1.8 

2.5 

4.2 

Solution flops (xlO®) 

0.1 

0.3 

0.8 

2.1 

2.9 

Solution time (s) 

0.1 

0.2 

0.6 

2.3 

9.5 

Communication time 

12.4% 

19.4% 

27.7% 

26.4% 

31.3% 

Speed-up over ScaLAPACK 

0.5 

1.1 

0.9 

1.5 

2.0 

STRUMPACK 

Maximum rank 

313 

638 

903 

1289 

1625 

HSS factors (GB) 

0.1 

0.5 

1.1 

3.0 

3.3 

Gompression flops (xlO^^) 

0.6 

18.8 

132.7 

626.1 

1716.7 

Gompression time (s) 

2.0 

13.0 

30.6 

60.8 

133.6 

Factorization flops (xlO^^) 

0.04 

0.5 

1.7 

5.9 

7.7 

Factorization time (s) 

0.3 

1.0 

1.5 

2.7 

5.0 

Solution flops (xlO®) 

0.2 

1.1 

2.9 

6.9 

9.5 

Solution time (s) 

0.04 

0.3 

0.4 

0.6 

0.8 

Gommunication time 

24.2% 

24.0% 

27.0% 

28.5% 

28.9% 

Speed-up over ScaLAPAGK 

1.8 

4.0 

5.4 

4.8 

3.9 

Speed-up over Hsolver 

3.8 

3.7 

6.0 

3.3 

2.0 


Table 6: Comparison between ScaLAPACK, Hsolver, and STRUMPACK for dense matrices arising from the 
multifrontal factorization of the discretized Helmholtz equations. 


compression time is reasonably constant, as shown in the last row of the table. 

The last experiment in this section is a strong scaling benchmark. We use one test problem, a matrix 
arising from the discretization of an electromagnetic sphere using BEM, with size 130,000. We compare the 
run time for solving a linear system with ScaLAPACK and STRUMPACK, using a number of MPI processes 
ranging from 256 to 4,096. For this problem, the maximum rank is 5,500. 

We observe that although the scalability of STRUMPACK is quite good, the gap between ScaLAPACK 
and STRUMPACK reduces when the number of processes increases. As explained in Section 3.4, this is 
because when the HSS rank is large (which is the case in this problem), communication volume for the 
traversal of the tree becomes larger than that with ScaLAPACK. The breakdown of the run time for the 
parallel HSS compression with 4,096 MPI tasks is the following: 15% of the time is spent in the initial matrix 
distribution, and 25% of the time is spent in the two products = AR'' and = A*R^. The rest (60%) 
is spent traversing the HSS tree to compute the local samples and generators. The major part is spent 
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k (matrix size: k x k) 

2,000 

4,000 

8,000 

16,000 

32,000 

MPI tasks 

1 

4 

16 

64 

256 

Maximum rank 

21 

26 

31 

38 

44 

Compression flops per process (xlO®) 

1.08 

1.06 

1.06 

1.06 

1.05 

Compression time (s) 

0.10 

0.21 

0.37 

0.54 

2.51 

Compression time w/o redistribution (s) 

0.10 

0.11 

0.13 

0.20 

0.42 


Table 7: Weak-scaling experiment for dense matrices arising from the multifrontal factorization of the 
discretized 2D Poisson equation. 
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MPI tasks 

256 

512 

1,024 

2,048 

4,096 

LU 

Time (s) 

862.4 

466.8 

257.4 

157.7 

132.9 

% comm. 

30.5% 

37.9% 

45.0% 

55.9% 

74.3% 

HSS 

Time (s) 

469.7 

233.6 

176.3 

130.2 

121.9 

% comm. 

34.7% 

31.2% 

42.1% 

47.3% 

68.0% 


(b) Statistics. 


Figure 9: Strong scaling experiment: run time for solving a linear system for a matrix with size 130,000, 
arising from the discretization of an electromagnetic sphere using BEM, with maximum rank 5,500. 


computing Interpolative Decompositions, which represent 50% of total compression time. We observed that 
in most cases the flop-rate of the Interpolative Decomposition (modihed version of PxGEQPF) is much lower 
than that of PxGEMM or PxGETRF. This is due to the fact that it relies on a BLAS2 algorithm. A BLASS 
implementation appears in the literature but the code is not publicly available [5]. Implementing a BLASS 
version is left for future work. 

We are investigating different techniques to accelerate the distribution of the input matrix A. Further¬ 
more, some recent works investigate communication optimal matrix-matrix multiplication algorithms [29] 
and improvements for rectangular matrix multiplications [13]. Our implementation would directly benefit 
from any improvement resulting from this research. 


5 Conclusion 

We presented the dense matrix computation package STRUMPACK that uses Hierarchically Semi-Separable 
representations to compress an input matrix and performs operations with this compressed form, such as 
solving linear systems or performing matrix-vector products. For matrices from certain classes of applications, 
such as finite element or boundary element methods, or applications that involve Toeplitz matrices, using 
HSS techniques allows to perform these operations asymptotically faster than when traditional algorithms 
(e.g., LU factorization) are used. The compression algorithm, which is the cornerstone of the framework, 
is parametrized by a compression threshold that allows the package to be used as a direct solver with full 
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accuracy or as a robust preconditioner. Our compression algorithm employs randomized sampling and is 
the first distributed-memory implementation that we know of. Furthermore, we introduced an adaptive 
sampling mechanism that allows the code to be used in a black-box fashion. 

The STRUMPACK package is very general; it can be used with any number of MPI processes and can 
accommodate different hierarchical partitionings of the input matrix. Furthermore, it is an open source pack¬ 
age made available to the community. The code is released under the BSD-LBNL license and the version pre¬ 
sented here is currently available at http: //portal .nersc .gov/project/sparse/strumpack/STRUMPACK-Dense-0.9.0. tar 

Work is in progress to use this dense package within a sparse solver. We have also developed a 
shared-memory sparse solver [16] and our goal is to combine these two codes in order to obtain a hy¬ 
brid (MPI-|-OpenMP) sparse solver. Another aspect that we wish to explore is using HSS techniques in 
matrix-free frameworks. As mentioned here, our algorithm is amenable to a matrix-free implementation 
where the user only provides a matrix-vector product and a routine to access selected elements of the matrix 
on the fly. This feature will be included in a future version of STRUMPACK. 
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A Two-stage triangular solution process 

This appendix illustrates the ULV solve algorithm 4 starting from expression (11), giving the explicit ULV 
factorization of A for a 3 level HSS matrix. After ULV factorization, the solution of Ax = b can be obtained 
as a; = V~^L~^U~^b. In (11), the transformations applied to A from the left form U~^, the transformations 
applied to A from the right form V~^ and the big matrix in the right-hand side of Equation (11) forms L. 
Define 

^ Vr, if T is a leaf 

Vuub 

^2;b 

Let br = 6(/t) for leaves t and b^. = Now, we first compute U~^b by applying the fir transformations 

and the permutations to the right-hand side b 



Vr = QrVr with Vr = 



We write down explicitly the forward triangular substitution y = L ^b 



J/3 — 1^3 
J/4 = L4 ^^4;t 

2/1 = ^ — L4^3j/3 — ^3,42/4^ 

2/5 = L'^^b^-t 
2/6 = Lq 

2/2 = L2 ^ — L^^Qy^ 

^4,32/3 - ^3,42/4^ - Wi-bQi-tVl - ^1,2^2* 2/6 

/ L '/6;5j Q*y^^ 

\ \v* V* A 

Le,5y5 — L5fiye ) — W2-bQ2-,ty2 — B2^iV{ ~ ~ 2/4 

/ L '/4;5j 

(15) 
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Clearly, this substitution should be performed bottom-up, i.e., first compute the leaves j/s, j/4 j/s and y^, 
then yi and y 2 and finally t/o- Now we introduce the intermediate variable Zt, defined as 


Then for a non-leaf node 
yr = L~^VlT-^tbT with 


V*.^yT, if T is a leaf 


V* -I- V*.^yr, if T is a non-leaf 


T (f.i., nodes 1 and 2), with two children i/i and 1^2 which are both leaves. 


(16) 

we have 


br 







= 


bi-i^b 

bl>2\b ^'^2\bQl/2;ty'^2. 


bvi-,b B, 2 i,l> 2 ^i, 2 Ql> 2 -,ty '^2 

K.b - - W,2,.bQl,,tyi22_ 


(17) 

(18) 


Due to the definition of Zr as given in (16), the definition of br for non-leaf nodes. Equation (18), is also 
valid for nodes higher up in the hierarchy, for which the situation is slightly more complicated. Consider 
node 0, for which yo = DQ^bo, with 


bo = 


bi-b 

b2-b 

bi-b 

&2;6 


W^,bQl,^y^ - B 1.2 I C 2 * 

W2,bQ%ty2 - B2,1 ( Cl* 



Vo*M 


VZtVA, 


■Co* 


■Cl* 


C* 


5:6 


C* 


3:6 



V* 

M;6 



Wi-bQl.yi - Si,2 1 

[v2* 

25 

.Z6_ 

+ ^2-ty^ 


W2-bQl.ty2 - S2,l 1 

[Ci* 

Z3 

_Z4_ 




bl;b — ICiihQi.jJ/i — Si,2^2 

.^2;6 — W2-bQ2-,ty2 — -B2,l2l 


(19) 


( 20 ) 


Hence, the Zr variables accumulate the contributions to the right-hand side from the already eliminated HSS 
nodes. We can compute yr as yr = Lr^K, except at the root where yo = D^^bo, which is computed using 
standard LU decomposition of Dq. Finally, the orthogonal transformation V~^ involving the Qr matrices 
should be applied to y to obtain the solution vector x. 
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